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NOTATIONAL  CONVENTIONS 
USED  IN  THIS  THESIS 


Symbol  Meaning 

log  logarithm  with  base  2 

-=>  "implies" 

<=>  "if  and  only  if" 

=  "is  defined  to  be" 

LxJ  greatest  integer  <  x 
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CHAPTER  1 
OVERVIEW 

1.1.  Background 

Two-dimensional  arrays  of  switching  circuits  have  been  studied  for 

some  time,  beginning  notably  with  the  work  of  Hennie.   Other  early 

2 
research  in  the  area  is  reviewed  in  the  survey  article  by  Minnick. 

3       4 
Important  contributions  were  later  made  by  Kautz,   Akers,  ,  and  several 

others.   Most  of  the  work  cited  above  consists  of  theoretical  investiga- 
tions into  the  ability  of  two-dimensional  logic  networks  tc  realize 
arbitrary  combinational  functions. 

Of  more  practical  interest  have  been  the  many  schemes  proposed  which 
use  two-dimensional  iterative  arrays  to  implement  specific  computations. 
As  examples,  networks  have  been  designed  for  performing  multiplication, 

ft  7 

division,   and  square  root  extraction.    Because  of  the  modularity 
inherent  in  array  structures,  such  circuits  possess  great  advantages  in 
the  modern  world  of  LSI  technology. 

This  thesis  explores  the  idea  of  transforming  a  two-dimensional  array 
into  an  equivalent  one  which  performs  the  same  computation  but  in  less 
time,  preferably  using  no  more  hardware  than  the  original  array.   We  could 
thus  obtain  a  speed  increase  while  preserving  the  modularity  of  cellular 
design.   Ideally,  we  would  like  to  have  an  algorithm  which  automatically 
performs  this  transformation  for  a  large  class  of  two-dimensional  networks, 

1.2.  Method  1 

We  shall  consider  here  a  rectangular  network  of  the  kind  shown  in 
Figure  1.   All  modules,  or  "cells",  in  the  network  are  identical  and 
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3 
consist  of  combinational  logic  only.   Network  inputs  and  outputs  are  as 
shown.   Throughout  this  paper  we  shall  ignore  the  cell  outputs  along  the 
bottom  edge  of  the  network. 

The  array  has  n  columns  and  k  rows  of  modules  and  will  be  referred  to 
as  an  (n,k)  network.  For  convenience  we  shall  assume  throughout  this 
work  that  n  is  a  power  of  2  and  that  1  <  k  <  n.  Each  arrow  in  Figure  1 
represents  a  bundle  of  m  bit  lines  (m  >  1).  By  S  we  shall  denote  the 
set  of  2  distinct  states  such  a  bundle  can  assume.  However,  we  will 
normally  think  of  a  bundle  as  a  single  "line"  carrying  a  signal  which 
represents  an  element  of  S. 

Figure  2  shows  a  single  module.   We  shall  assume  that  the  propagation 
time  to  either  output  is  exactly  one  "unit  delay".   The  output  values  will 
be  denoted  as  shown;  we  think  of  #  and  *  as  binary  operations  defined  on 
the  set  S  (that  is,  as  mappings  from  the  Cartesian  product  S  x  S  into  S). 

We  are  concerned  mainly  with  the  propagation  delay  of  the  network  in 
Figure  1,  which  we  shall  henceforth  call  a  Method  1  network.   It  is  easy 
to  see  that  once  the  inputs  are  applied  it  takes  n  +  k  -  1  unit  delays 
before  all  the  outputs  are  valid.  We  want  to  transform  this  array  into  a 
faster  one  which  produces  the  same  outputs. 

1.3.   Proposed  New  Methods 

In  Chapter  2  such  a  transformation  is  described.   We  shall  call  this 
new  way  of  doing  the  computation  Method  2,  Method  1  referring  to  the 
original  rectangular  network.   Method  2  can  often  yield  a  striking  improve- 
ment in  performance,  and  it  has  the  advantage  of  being  applicable  to  any 
array,  regardless  of  the  operations  #  and  *.   However,  Method  2  is  faster 
than  Method  1  only  when  k  is  quite  small  compared  to  n;  otherwise,  it  is 
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Figure  2.   Single  cell  from  rectangular  network. 


5 
actually  slower.   Method  2  also  requires  more  hardware. 

In  the  search  for  a  better  transformation,  Method  3  was  developed;  it 
is  introduced  in  Chapter  3.   Most  of  this  thesis  is  devoted  to  investi- 
gating the  properties  of  Method  3,  which  is  a  superior  technique  in 
several  ways. 

First  of  all,  Method  3  is  a  purely  geometric  transformation  of  the 
rectangular  array.   That  is,  the  Method  3  network  uses  not  only  the  same 
type  of  module  as  Method  1,  but  also  the  same  number  of  them.   Only  the 
interconnection  pattern  is  different.   Yet  we  shall  show  that  for  most 
values  of  k  Method  3  is  substantially  faster  than  Method  1  and  that  in  no 
case  is  it  slower.   Moreover,  Method  3  always  outperforms  Method  2. 
Chapter  5  develops  and  discusses  these  speed  comparisons. 

Unfortunately,  Method  3  will  not  work  for  all  arrays,  since  in  general 
it  is  not  an  output-preserving  transformation.   Only  for  certain  instances 
of  the  cell  operations  #  and  *  will  the  Method  1  network  and  the  corres- 
ponding Method  3  network  perform  the  same  computation;  this  matter  is 
discussed  in  Chapter  4.   Perhaps  future  research  will  be  able  to  modify 
Method  3  so  as  to  extend  its  range  of  application.   In  any  event,  the 
many  advantages  of  Method  3  give  it  excellent  potential  as  a  design 
algorithm  for  speeding  up  computations  done  by  two-dimensional  arrays. 


CHAPTER  2 
METHOD  2 

2.1.   Construction 

To  help  explain  Method  2,  we  shall  use  as  an  example  a  (4,3) 
rectangular  network,  which  Figure  3  shows  in  detail.   Let  us  first  consider 
the  top  row  of  cells.   Viewed  by  itself,  it  is  a  one-dimensional  iterative 
array,  otherwise  known  as  a  (first-order)  recurrence  network.   There  exists 
a  general  method  for  speeding  up  such  a  computation;  it  involves  using 
semigroups  of  functions  to  obtain  a  tree   implementation  of  the  network. 

This  technique  has  been  described  recently  by  several  authors,  including 

8  9 

Gajski  and  Unger. 

However,  to  simplify  our  discussion  let  us  assume  that  the  operation 

#  is  associative.   In  that  case  we  can  derive  a  tree  implementation 

directly,  for  y  =  a..  ,  //  (a-.-  #  (a,  9  #  (a.,  #  X-)))  can  be  computed  as 

a  ,  #  ((a-  -  //  a1?)  #  (a1  -  #  x..)).   In  this  way  some  parallelism  can  be 

introduced  into  an  originally  serial  computation.   For  a  detailed 

discussion  of  tree-height  reduction  by  associativity,  see  pp.  101  -  102 

*    v      u  10 
of  Kuck. 

Anyway,  Method  2  uses  this  technique  to  transform  the  first  row  of  the 

rectangular  array  into  the  network  shown  in  Figure  4.   Figure  5  depicts 

the  two  modules  used  in  this  network;  we  assume  each  has  propagation 

time  of  one  unit  delay.   Observe  that  the  cell  in  Figure  2  is  simply 

these  two  modules  combined  into  one  box.   With  all  of  these  modules  note 

carefully  the  order  of  the  inputs,  for  #  and  *  are  not  necessarily 

commutative. 


X- 


Figure  3.   A  (4,3)  rectangular  (Method  1)  network. 
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Figure  4.   First  stage  of  (4,3)  Method  2  network. 
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Figure  5.   Modules  used  in  Method  2  network. 
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It  is  easy  to  verify  that  the  network  of  Figure  4  correctly  computes 
y1  and  the  a~.    from  the  inputs  x,  and  the  a    (1  <  i  <  4).   We  now  apply 
exactly  the  same  transformation  to  the  second  row  of  the  Method  1  network 
to  obtain  the  second  stage  of  the  Method  2  network.   This  stage  takes  as 
inputs  the  a~  .  from  the  first  stage,  along  with  x~  ;   it  computes  y  and 
the  a„..   The  third  row  of  the  rectangular  array  is  similarly  transformed 
to  complete  the  (4,3)  Method  2  network.   (Actually,  this  last  stage  can 
be  somewhat  simplified  because  the  a,,  need  not  be  computed.) 

As  another  example,  Figure  6  depicts  one  stage  of  a  Method  2  network 
for  n  =  16.   Here  solid  circles  represent  //-modules  and  open  circles 
represent  *-modules.   (In  this  more  schematic  drawing,  suggested  by 
Gajski   ,  one  can  begin  to  see  the  recursive  nature  of  this  design.   That 
is,  the  circuit  for  n  is  constructed  in  a  regular  way  from  two  circuits 

for  ■=■  .   For  more  details  on  this  matter,  see  the  paper  by  Ladner  and 

12 
Fischer   ,  especially  their  Figure  3.)   Again,  it  is  easy  to  see  that 

this  circuit  performs  the  desired  computation.   A  complete  (16, k)  Method  2 

network  consists  of  k  of  these  stages  cascaded  together. 

2.2.   A  Bound  for  Computation  Time 

This  transformation  works  for  any  (n,k)  rectangular  network. 
Basically,  what  Method  2  does  is  compute  each  output  y.  by  fanning  in 
n  +  1  lines  with  //-modules  as  fast  as  possible.   Recalling  that  n  is  a 
power  of  2,  we  see  that  each  stage  of  the  Method  2  network  can  complete 
all  of  its  computations  in  1  +  log  n  unit  delays,  once  all  of  the  stage 
inputs  have  become  available.   Thus  the  total  propagation  time  for  the 
entire  (n,k)  Method  2  network  is  no  greater  than  k(l  +  log  n)  unit  delays, 
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Figure  6.   Stage  j  of  Method  2  network  for  n  =  16. 
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because  there  are  k  stages  in  all.   We  shall  see  later  that  the  actual 
propagation  time  is  somewhat  less  than  this  upper  bound,  although  over 
the  range  of  interest  the  difference  is  not  very  great. 

Let  T  denote  the  computation  time  for  an  (n,k)  network  using  Method  1 
and  T„   denote  the  time  required  using  Method  2.   We  then  have: 

T  -  n  +  k  -  1 
T2  <  k(l  +  log  n)  E  T2 

For  the  moment  let  us  assume  that  the  upper  bound  T9  is  the  true  Method  2 
delay  time.   It  is  trivial  to  prove  the  following  result: 

n  -  1        * 

log  n         2    1 

n  -  1        A 

k  >  % -   =^   T„  >  T- 

log  n         2    1 

n  —  1      a  a 

We  shall  denote  the  crossover  point by  c  .   Note  that  c   is  generally 

log  n      n  n 

much  less  than  n;   for  example,  if  n  =  128,  c  ^18. 

Thus  Method  2  is  faster  than  Method  1  only  when  k  is  small  relative  to 

n;  otherwise,  Method  1  is  faster.   If  k  <  c  ,  then  k  «  n,  whence  the 
'  n*  ' 

speedup  (as  defined  by  Kuck  )  of  Method  2  over  Method  1  is : 


T 
1     n  +  k  -  1 


^      k(l  +  log  n) 
2 


1  /_JL\ 
k  ^ log  n  J 


When  k  =  1,  which  corresponds  to  the  rectangular  array  having  only  one 

row,  this  becomes  ■= ,  the  well-known  speedup  obtained  when  a  one- 
log  n 

dimensional  iterative  array  is  replaced  by  a  binary  tree  implementation. 
Hence,  one  can  view  —  as  an  attenuation  factor  which  arises  in  the  two- 
dimensional  case. 
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2.3.   Actual  Computation  Time 

As  mentioned  earlier,  the  true  Method  2  propagation  time  T„  is  in 
general  somewhat  smaller  than  T_  =  k(l  +  log  n) .   To  see  why  this  is  so, 
let  us  look  again  at  the  example  shown  in  Figure  6.   Although  for  the  sake 
of  regularity  we  have  arranged  the  diagram  to  show  5  rows  of  modules,  the 
rightmost  8  columns  actually  have  depth  less  than  5.   Thus  (taking  j  =  1) 
the  first  stage  of  the  Method  2  network  produces  output  y.  after  5  unit 
delays,  but  a„  ,  a„„,  a„«,  ...  ,  a~fi  are  ready  earlier.   Hence,  the  second 
stage  of  the  network  can  begin  computing  sooner  than  we  might  have  expected, 
and  many  of  its  outputs  are  ready  after  far  less  than  10  unit  delays. 

Continuing  down  through  later  stages,  this  speedup  effect  begins  to 
snowball,  aided  by  the  fact  that  most  of  the  columns  (even  on  the  left 
side)  do  not  have  modules  in  every  row.   The  effect  drifts  from  right  to 
left,  until  after  a  few  stages  the  speedup  starts  to  appear  in  the  outputs 
y..   This  means  that  effectively  the  propagation  time  through  each  of  the 
later  stages  is  less  than  5  unit  delays.   Hence,  for  the  (16, k)  network  we 
find  that  T?   <   T     =  5k,  at  least  for  the  larger  values  of  k.   Table  1  details 
the  situation  for  this  example. 

Naturally,  the  preceding  discussion  applies  regardless  of  the  value  of 
n.   In  all  cases,  the  effective  propagation  time  is  1  +  log  n  only  for  the 
first  several  (in  fact,  log  n)  stages;  thereafter,  the  stage  delay 
decreases.   Thus,  keeping  n  fixed,  the  difference  between  T_  and  T„  grows 
with  increasing  k.  As  Table  1  shows,  this  difference  can  become  quite 
large  as  k  approaches  n. 

However,  we  are  not  really  interested  in  the  larger  values  of  k  if  in 
those  cases  Method  2  is  outperformed  by  Method  1,  anyway.   What  we  would 
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k 

T 
2 

T 
2 

1 

5 

5 

2 

10 

10 

3 

15 

15 

4 

20 

20 

5 

25 

24 

6 

30 

28 

7 

35 

32 

8 

40 

35 

9 

45 

38 

10 

50 

40 

11 

55 

42 

12 

60 

44 

13 

65 

45 

14 

70 

46 

15 

75 

47 

16 

80 

48 

effective  delay  time     ~ 

of  kth  stage 2  "  2 


5  0 

5  0 

5  0 

5  0 

4  1 

4  2 

4  3 

3  5 

3  7 

2  10 

2  13 

2  16 

1  20 

1  24 

1  28 

1  32 


Table  1.   Propagation  time  (in  unit  delays) 
for  (16, k)  Method  2  network. 


15 
like  to  know  is,  for  each  n,  the  largest  value  of  k  for  which  Method  2  is 
faster  than  Method  1.   Earlier  we  calculated  this  crossover  point  to  be 

c  =  -s .   (Since  c   is  not  in  general  an  integer,  the  value  of  k  we 

n   log  n  n  °      ' 

desire  would  be  [c    J.)   But  recall  that  this  was  computed  on  the  basis  of 

T„ ,  not  Ty .   Thus  we  now  define  c  to  be  the  true  crossover  point  between 

the  two  methods.   That  is,  c  is  the  integer  such  that: 

n 

k  <  c   =>    T„  <  T, 
-  n         2    1 

k  >  c   =>    T_  >  Tn 
n         2    1 

Since  in  general  T_  <  T„,  Method  2  should  compare  more  favorably  to 

Method  1  than  we  had  thought.   In  particular,  we  would  expect  that 

c  >  Lc  J.   But  what  is  the  magnitude  of  this  improvement?   The  answer  was 

obtained  from  a  computer  simulation  of  Method  2  which  determined  T~  for 

values  of  n  up  through  1024  and  all  relevant  values  of  k.   From  this  were 

derived  values  of  c  ,  which  are  displayed  in  Table  2. 

n 

c 

Note  that  over  the  range  considered  — •* —  <  1.2,  an  improvement  of  less 

c    ic   °n 
than  20%.   More  significantly,  — LJU-  <  . 02 ;  that  is,  the  difference 

between  the  true  crossover  and  the  estimated  one  is  no  more  than  2% 

of  n.   In  addition,  the  computer  study  showed  that  for  k  <  c   (the  range 

over  which  Method  2  would  be  used)  the  difference  between  T„  and  T„  is  not 

very  great ;  the  improvement  is  never  more  than  14% ,  and  usually  it  is  much 

less.   T„  and  T„  are  especially  close  when  k  is  very  small,  which  is 

precisely  when  we  would  be  most  likely  to  use  Method  2  (because  in  this 

range  Method  2  outperforms  Method  1  by  the  greatest  amount) .   Later  in 

this  paper  Figure  8  and  Figures  16  through  19  provide  graphical  comparisons 


16 


CM 
O 

t— 1 

o 

CM 

CM 

o 

CM 

.-I 

o 

00 

rH 

00 
rH 

o 

CM 
rH 

m 

• 

rH 
rH 

00 

O 
CM 
O 

vD 
CM 

CO 

H 
m 

H 

CM 

rH 

vO 
rH 

rH 

o 

CM 

o 

00 
CM 
rH 

00 
rH 

m 
H 

• 

-3- 
rH 

O 

00 

o 

o 

H 

o 

rH 

rH 

O 
rH 

rH 

rH 

O 

CM 
00 

vO 

<D 

H 

rH 

rH 

o 

rH 

m 

CO 

rH 

o> 

rH 

rH 

o 

00 

CM 

CM 

m 

CM 

m 

CM 

rH 

o 

<r 

rH 

rH 

m 

CM 

m 

CM 

rH 

o 

C 

a 

^1 

<  O 

c 

CJ 

<  a 

<  o 
1 

u 

c 

<  o 

X) 

C 


d 
<u 
a) 
& 

4-1 

c 
o 

CO 

•H 

r« 

CO 
&. 

6 
o 

CJ 


CM 

cu 

rH 

cfl 
H 


17 

of  T    T2,  and  1^. 

In  conclusion,  we  see  that  although  Method  2  performs  better  than 

our  upper  bound  T„  would  indicate,  the  difference  is  not  particularly 

significant.   For  most  practical  purposes,  the  propagation  time  through  an 

(n,k)  Method  2  network  can  be  roughly  approximated  by  the  simple  formula 

k(l  +  log  n).   Moreover, serves  as  a  reasonable  estimate  for  the 

log  n 

crossover  point  between  Method  2  and  Method  1. 

2.4.   Required  Hardware 

Looking  at  Figures  4  and  6,  we  see  that  a  single  stage  of  an  (n,k) 
Method  2  network  will  require  exactly  n  *-modules.   Moreoever,  it  is  not  hard 
to  show  that  the  number  of  //-modules  needed  is  1  +  rn  log  n.   Since  there  are 
k  stages,  the  entire  Method  2  network  requires  nk  *-modules  and  nk(y  log  n)  +  k 
//-modules.   (We  neglect  the  fact  that  the  last  stage  can  be  simplified.) 

Of  course,  Method  1  uses  exactly  nk  cells.   These  are  equivalent  to  nk 
*-modules  and  nk  //-modules,  a  significant  savings  in  //-modules  over  Method 
2.   Moreover,  the  Method  1  modules  may  possibly  be  simplified  by  gate- 
sharing,  since  two  outputs  are  being  produced  from  the  same  set  of  inputs. 

(For  a  discussion  about  design  of  such  multiple-output  circuits,  see,  for 

13 

example,  pp.  131-139  of  McCluskey.   ).   Thus,  from  the  point  of  view  of 

pure  switching  theory,  we  see  that  Method  2  will  require  somewhat  more 
hardware  than  Method  1. 

In  reality,  this  difference  may  be  magnified  by  some  pragmatic  factors 
favoring  Method  1.   If  we  assume  one  cell  per  chip,  the  total  package 
count  is  nk  for  Method  1  versus  (1  +  y  log  n)nk  +  k  for  Method  2.   This  is 
a  quite  relevant  comparison  even  though  the  Method  2  cells  are  simpler,  for 
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these  days  the  number  of  chips  mounted  on  the  PC  boards  usually  has  a 
greater  impact  on  cost  than  does  on-chip  complexity.   In  addition,  Method 
1  uses  only  one  type  of  chip,  whereas  Method  2  requires  two  types.   This 
disparity  is  quite  meaningful,  considering  the  high  initial  cost  of 
integrated  circuit  fabrication. 

2.5.   Fan-out  and  Method  2' 


Another  problem  with  Method  2  is  the  matter  of  fan-out.   Looking  at 
Figure  6,  consider  the  lowest  //-module  in  the  column  with  input  a.7.   It 
is  required  to  drive  8  //-modules  plus  1  *-module.   Assuming  that  each 
module,  //  or  *,  presents  a  single  "unit  load",  the  //-module  in  question 
must  be  able  to  handle  a  fan-out  of  9.   This  is  the  highest  fan-out  in  the 
network;  it  is  much  lower  for  most  other  modules.   However,  since  all 
//-modules  are  presumed  to  be  identical,  they  must  be  designed  to  accomodate 
the  worst  case.   Generalizing,  it  is  easy  to  see  that  the  maximum  fan-out 
occurring  in  an  (n,k)  Method  2  network  is  y  +  1 .   If  n  is  at  all  large, 
this  power  requirement  may  well  be  unacceptable. 

To  remedy  this  situation  we  shall  now  introduce  a  variation  on  Method  2 , 
called  Method  2' ,  which  is  based  on  a  suggestion  made  by  Gajski.  "  Figure  7 
illustrates  the  new  method  for  the  case  n  =  16  and  is  to  be  compared  to 
Figure  6.   Both  circuits  produce  the  same  outputs,  a  result  which  follows 
from  our  assumption  that  the  //  operation  is  associative.   Moreover,  in 
Method  2'  we  cascade  stages  exactly  as  we  did  in  Method  2.   What  we  have 
done  in  Figure  7  is  to  insert  some  additional  //-modules  and  make  the 
interconnection  pattern  more  uniform. 

As  a  consequence,  fan-out  in  Method  2'  is  more  equally  distributed 
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Figure  7.   Stage  j  of  Method  21  network  for  n  =  16. 
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among  the  //-modules.   Indeed,  the  largest  fan-out  in  Figure  7,  handled 
by  the  rightmost  module  in  the  top  row,  is  only  4  unit  loads.   In  general 
one  can  see  that  the  maximum  fan-out  in  an  (n,k)  Method  2'  network  is 
log  n.   This  is  well  within  the  limits  of  current  technology,  even  for 
very  large  n. 

However,  Method  2'  does  entail  some  disadvantages.   The  most  obvious 
drawback  is  that  of  extra  hardware.   A  straightforward  calculation  shows 
that  the  number  of  //-modules  required  per  stage  for  Method  2'  is 
n(log  n  -  1)  +  2,  compared  torn  log  n  +  1  for  Method  2.   Hence,  the  new 
method  uses  approximately  twice  as  many  //-modules.   (Note  that  the  number 
of  ^-modules  is  the  same  for  both  methods:  exactly  n  per  stage.) 

Another  penalty  incurred  with  Method  2'  is  a  slight  decrease  in  speed. 
Comparing  Figures  6  and  7,  we  see  that  propagation  times  to  corresponding 
outputs  are  identical  (assuming  all  inputs  arrive  simultaneously.)   Yet 
when  many  stages  are  cascaded  together,  the  overall  network  delay  times  for 
the  two  methods  differ!   This  surprising  result  stems  from  the  fact  that 
Method  2'  fills  in  most  of  the  gaps  in  the  //-module  "matrix"  of  Figure  6. 
This  hinders  the  stage-by-stage  speedup  effect  described  earlier  for 
Method  2. 

However,  some  of  this  speedup  does  occur  in  Method  2' ,  developing 
roughly  as  follows.   Referring  to  Figure  7,  we  see  that  the  right  half  of 
the  circuit  computes  faster  than  1  +  log  n,  the  delay  time  for  the  left 
half.   Now  note  that  the  outputs  of  each  stage  are  fed  into  the  inputs  of 
the  next  stage  shifted  to  the  left  by  one  column.   From  this  it  follows 
that  the  speedup  effect  travels  to  the  left  by  one  column  per  stage;  it 
begins  to  appear  at  the  network  outputs  y.  only  after  the  first  y  stages. 
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Thus  for  the  first  y  stages  the  effective  stage  delay  time  is  fully 
1  +  log  n. 

Let  T?l  denote  the  computation  time  for  an  (n,k)  Method  2'  network. 

A  J} 

From  what  we  said  above,  we  have  that  T~,  =  k(l  +  log  n)  =  T„  if  k  <  y  . 

n  .  -  1     n 

But  since  c  =  ■; <  ■=-  (for  n  >  A)  ,  it  follows  that  the  crossover  point 

n   log  n   2  ' 

between  Method  1  and  Method  2'  is  simply  c  and  that  for  k  <  c  , 

n  n 

T_,  =  T2  <   T  .   That  is,  in  the  case  of  Method  2'  our  simple  upper  bound 

formulas,  time  =  k(l  +  log  n)  and  crossover  =  ■= ,  hold  exactly! 

log  n  * 

(Actually,  T~,  <  k(l  +  log  n)  when  k  >  y  ,  but  of  course  in  that  range  we 

would  never  use  Method  2',  anyway,  since  it  is  slower  than  Method  1.) 

Now  recall  from  our  study  of  Method  2  that  c  ,  the  crossover  point 

n 

between  Methods  1  and  2,  is  slightly  greater  than  c  .   Remember  also  that 
for  k  <  c  ,  T_  is  slightly  less  than  T„ .   Therefore,  over  the  range  of 
interest  Method  2  outperforms  Method  2 '  by  a  small  amount .   However ,  this 
difference  in  speed  (plus  the  lower  module  count  of  Method  2)  must  be 
weighed  against  the  considerable  fan-out  advantage  of  Method  2'. 

2.6,   Summary 

As  we  stated  earlier,  an  (n,k)  Method  2  network  is  faster  than  the 

corresponding  (n,k)  Method  1  network  only  if  k  is  small  compared  to  n. 

cn 
Note  from  Table  2  that  —  is  generally  somewhere  between  .1  and  .2  . 

n 

That  is,  as  k  varies  from  1  to  n,  only  in  the  lower  10  to  20  percent  of 
that  range  (depending  on  n)  does  Method  2  outperform  Method  1.   The 
situation  is  similar  for  Method  2f ,  as  can  be  seen  from  the  tabulated 
values  of 


n 
Figure  8  compares  the  speeds  of  Methods  1,  2,  and  2'  in  a  general  way. 
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Figure  8.   Speed  comparison  of  Methods  1,  2,  and  2' 
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Considering  n  to  be  fixed,  we  have  sketched  T  ,  T?,  T?,  and  T_,  as 
functions  of  k,  which  runs  from  1  to  n.   (For  purposes  of  presentation 
these  curves  have  been  drawn  as  continuous;  of  course,  in  reality  k  takes 
on  only  integral  values.)  Figures  16  through  19  in  Chapter  5  present  the 
same  curves  plotted  from  actual  data  for  several  values  of  n.   In  addition, 
Table  6  (in  Chapter  6)  displays  side  by  side  the  major  characteristics  of 
Methods  1,  2,  and  2' . 

To  summarize  our  results  on  Method  2  (and  its  variation,  Method  2'), 
we  may  say  that  this  technique  has  definite,  though  rather  limited, 
advantages.  Method  2  is  a  simple  and  quite  general  approach  for  achieving 
significant  speedup  in  two-dimensional  arrays.   However,  this  speedup  is 
obtained  only  when  the  relative  dimensions  of  the  array  are  in  a  certain 
narrow  range.   Thus  in  spite  of  the  fact  that  they  require  more  hardware, 
Method  2  networks  are  in  most  cases  actually  slower  than  the  original 
(Method  1)  rectangular  arrays.   It  is  this  difficulty  that  motivated  the 
development  of  Method  3. 
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CHAPTER  3 
METHOD  3:   CONSTRUCTION 

Like  Method  2,  Method  3  transforms  an  (n,k)  rectangular  (Method  1) 
network  into  a  circuit  consisting  of  k  stages  built  one  upon  the  other, 
each  producing  a  single  network  output.   In  both  cases  the  i   stage 
corresponds  to  the  i   row  of  the  rectangular  array.   However,  in  contrast 
to  Method  2,  the  modules  used  in  Method  3  are  identical  to  those  of  the 
original  network. 

3.1.   An  Example 

To  illustrate  the  new  technique,  we  shall  consider  an  (8,3)  Method  1 
network;  this  is  shown  in  Figure  9.   Figure  10  depicts  the  first  stage  of 
the  corresponding  Method  3  network.   The  modules  in  both  diagrams  are  the 
same;  however,  Figure  10  does  not  show  their  *  outputs  because  only  the  // 
outputs  are  used  within  stage  1.  What  we  have  done  in  Figure  10  is  to 
simply  fan  in  as  quickly  as  possible  the  inputs  a,  through  aR  plus  the 
input  x1  .   Output  z..  is  thus  the  //-product  of  these  nine  inputs.   This  is 
the  same  as  output  y-  in  Figure  9,  provided  that  //  is  associative. 

Now  note  that,  except  for  x„ ,  each  input  to  the  second  row  of  the 
Method  1  network  is  a  *-product  of  signals  occurring  within  the  first  row. 
Output  y„  is  then  the  //-product  of  all  of  these  inputs.   This  suggests 
that  we  construct  stage  2  of  the  Method  3  network  in  the  following  way. 
We  shall  simply  take  the  *  outputs  from  all  the  stage  1  modules  (along 
with  input  x?)  and  form  their  //-product  as  fast  as  we  can.   This  is  shown 
in  Figure  11,  where  dashed  lines  indicate  the  second  stage.   The  *  outputs 
of  the  stage  2  modules  are  not  shown,  since  they  will  not  be  needed  until 
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Figure  10.   Stage  1  of  (8,3)  Method  3  network. 
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Figure  11.   Stages  1  and  2  of  (8,3)  Method  3  network. 
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we  construct  stage  3. 

In  general  the  second  stage  output  z„  will  not  be  the  same  as  the 
output  y„  in  Figure  9.   However,  as  we  shall  see  in  Chapter  4,  there  do 
exist  cases  in  which  not  only  does  z„  equal  y„  ,  but  every  stage  of  the 
Method  3  network  produces  the  same  output  as  the  corresponding  row  of  the 
Method  1  network.   We  shall  then  say  that  the  Method  3  transformation  is 
"output-preserving".   The  existence  of  this  property,  without  which  Method 
3  is  useless,  depends  on  how  the  operations  #  and  *  behave. 

We  need  now  to  introduce  some  terminology  and  conventions.   Henceforth, 
we  shall  use  the  shorter  word  "cell"  in  place  of  "module".   In  the  Method  3 
network,  the  cells  of  stage  1  will  be  called  "1-cells",  those  of  stage  2, 
"2-cells",  and  in  general  the  cells  of  stage  i  will  be  called  "i-cells". 

The  level  numbers  listed  along  the  left  side  of  Figure  11  correspond 
to  propagation  times  within  the  Method  3  network.   That  is,  the  outputs  of 
all  cells  on  level  i  (regardless  of  to  which  stage  they  belong)  become  valid 
exactly  i  unit   delays  after  the  network  inputs  are  applied.   For  example, 
the  propagation  time  to  output  z..  is  4  unit  delays ,  whereas  z?  becomes 
valid  after  6  unit  delays. 

Because  of  the  way  our  figures  are  drawn,  we  shall  say,  for  example, 
that  level  3  is  "above"  level  5.   That  is,  higher  level  numbers  correspond 
to  "lower"  levels.   The  highest  level  (lowest  level  number)  on  which  an 
i-cell  occurs  will  be  called  the  "top"  level  of  stage  i.   For  instance, 
in  Figure  11  level  2  is  the  top  level  of  stage  2.   Similarly,  level  4  is 
the  "bottom"  level  of  stage  1.  The  number  of  levels  containing  at  least 
one  i-cell  will  be  called  the  "height"  of  stage  i.   For  example,  the 
height  of  stage  1  is  4,  whereas  stage  2  has  height  5. 
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In  constructing  stage  2,  as  well  as  all  other  stages,  no  attention  is 
paid  to  the  order  in  which  we  "//  together"  the  *  outputs  from  the 
previous  stage.  All  that  matters  is  that  no  time  is  wasted  in  performing 
the  computation.  As  soon  as  a  stage  1  *  output  becomes  available,  it  is 
immediately  used  to  compute  a  valid  //-product  in  stage  2.  Similarly,  the 
//  outputs  of  the  2-cells  are  fanned  in  to  form  z~  as  quickly  as  possible. 
In  this  way  we  obtain  a  great  deal  of  parallelism,  often  putting  a  number 
of  cells  from  several  different  stages  all  on  the  same  level. 

Actually,  some  "wasted"  time  is  unavoidable,  for  whenever  an  odd 
number  of  signals  appears  at  a  given  level,  one  of  them,  of  course,  cannot 
be  utilized  immediately  but  must  be  "reserved"  for  use  at  a  lower  level. 
For  example,  in  Figure  11  the  output  of  the  leftmost  cell  on  level  4  is 
not  used  until  level  6.   In  our  method  the  choice  of  which  signal  to 
reserve  is  made  arbitrarily,  except  that  by  convention  the  network  inputs 
x.  are  always  the  first  to  be  reserved.   For  instance,  x,  and  x» ,  though 
available  at  time  zero,  are  not  used  until  level  4. 

All  of  this  will  be  made  more  clear  when  in  the  next  section  we  present 
a  formal  algorithm  for  constructing  Method  3  networks.   Meanwhile,  let  us 
complete  our  example  by  showing  in  Figure  12  the  addition  of  stage  3, 
which  (like  stage  1)  is  indicated  by  solid  lines.   To  lessen  confusion 
we  have  labeled  each  cell  with  the  number  of  the  stage  to  which  it 
belongs.   The  inputs  to  stage  3  are  the  *  outputs  of  the  2-cells, 
together  with  the  network  input  x„.   The  //-product  of  all  of  these  is 
formed  by  the  3-cells  and  denoted  by  z    . 

Once  again  we  caution  that  in  general  z~  will  not  be  equal  to  y„  in 
Figure  9.   In  fact,  in  most  cases  the  Method  3  outputs  will  depend  on 
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Figure  12.   (8,3)  Method  3  network. 
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exactly  which  way  we  connected  the  various  cell  inputs  in  constructing 
each  stage.   However,  our  interest  lies  in  those  cases  for  which  the 
outputs  are  the  same  regardless  of  the  order  in  which  the  products  are 
formed,  these  outputs  being  equal  to  the  corresponding  Method  1  outputs. 
This  is  the  "output-preserving"  property  mentioned  earlier. 

As  examples  of  our  terminology,  observe  in  Figure  12  that  stage  3 
has  top  level  3,  bottom  level  8,  and  height  6.   Also  note  that  the  output 
of  the  leftmost  cell  on  level  5  is  reserved  until  level  8.  However,  the 
most  important  thing  to  observe  here  is  that  the  total  computation  time 
for  the  Method  3  network  is  8  unit  delays,  as  opposed  to  the  10  unit 
delays  Method  1  requires  (Method  2  takes  12). 

Note  that  Figure  11  can  be  viewed  as  either  the  first  two  stages  of 
an  (8,3)  network  or  as  a  complete  (8,2)  network.   Likewise,  the  array  in 
Figure  12  can  be  made  into  an  (8,4)  Method  3  network  by  simply  adding 
one  extra  stage.   The  *  outputs  of  the  3-cells  would  be  inputs  to  that 
stage,  which  would  be  constructed  in  the  same  way  as  the  previous  stages. 

As  a  further  illustration  of  our  procedure,  Figure  13  depicts  a 
(16,4)  Method  3  network.   Solid  lines  are  used  for  stages  1  and  3, 
dashed  lines  for  stages  2  and  4.   The  network  propagation  time  is  11 
unit  delays;  this  compares  to  19  for  Method  1  and  20  for  Method  2. 
Chapter  5  is  devoted  to  a  detailed  discussion  of  the  speedup  provided 
by  Method  3. 

3.2.   The  Algorithm 

We  now  present  a  formal  procedure  for  designing  any  (n,k)  Method  3 
array  (1  <  k  <  n,  n  a  power  of  2).   We  shall  make  use  of  the  notation  and 
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Figure  13.   (16,4)  Method  3  network. 
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terminology  introduced  in  describing  our  example.   The  algorithm 
consists  of  two  parts:   Step  A  is  performed  once,  then  Step  B  is  iterated 
k  -  1  times  (for  i  from  2  to  k) .   The  statement  of  the  algorithm  is 
followed  by  several  remarks  which  will  be  useful  later  in  the  paper. 

Algorithm  for  Constructing  an  (n,k)  Method  3  Network 

STEP  A 
To  construct  stage  1: 

1.  Recalling  that  n  is  a  power  of  2,  fan  in  the  inputs  a.,  a~ , .  .  .,  a 
with  a  full  binary  tree  of  cells,  using  only  the  #  outputs.   (The 
root  cell  of  this  tree  will  be  on  level  log  n.   See  Figure  10  for  an 
example  where  n  =  8 . ) 

2.  Install  an  additional  cell  on  level  log  n  +  1,  having  as  inputs  the 
network  input  x.  and  the  #  output  from  the  bottom  cell  of  the  above 
tree.   The  //  output  of  this  added  cell  is  the  network  output  z  . 

STEP  B 

To  construct  stage  i, 
once  stage  i  -  1  has  been  constructed: 

1.  "Reserve"  the  input  x  . 

2.  Let  L  be  the  number  of  the  top  level  of  stage  i  -  1. 

3.  Let  S  be  the  set  of  all  *  outputs  from  (i  -  l)-cells  on  level  L 
and  all  #  outputs  from  i-cells  on  level  L.   Let  C  denote  the 
cardinality  of  S. 
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4.  If  C  is  odd,  then  do  the  following: 

If  there  is  a  reserved  element,  add  it  to  S;  otherwise, 
remove  an  element  from  S  and  reserve  it. 
C  is  now  even  or  zero. 

5.  If  C  is  not  zero,  then  do  the  following: 

Q 

a.  Install  y  i-cells  on  level  L  +  1,  connecting  their  inputs  to  the 
elements  of  S  (in  any  order  whatever) . 

b.  Increment  L  by  one. 

c.  Go  to  step  3. 

6.  Now  C  =  0  and  we  are  done.   The  reserved  element  is  the  network 
output  z . . 


REMARKS  (valid  for  all  i) 

Remark  1   The  top  level  of  stage  i  is  level  i.   (This  is  obviously  true 
for  i  =  1,  and  an  easy  induction  verifies  it  for  i  >  1.) 

Remark  2   Note  that  stage  i  skips  no  levels;  that  is,  it  has  at  least 
one  cell  on  every  level  between  its  top  level  and  its  bottom  level. 

Remark  3   Clearly,  the  bottom  level  of  stage  i  is  below  the  bottom 
level  of  stage  i  -  1  and  contains  exactly  one  i-cell  (whose  //  output 
is  zi). 

Remark  4   Hence,  the  total  number  of  levels  in  the  network  is  equal  to 
the  bottom  level  number  of  stage  k.   This  is  the  propagation  time  through 
the  network  (in  unit  delays). 
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Remark  5   Since  each  i-cell  has  two  input  lines  and  one  #  output  line , 
stage  i  forms  a  regular  binary  tree.   The  branches  of  this  tree  are  the 
input  lines  and  #  output  lines  of  all  the  i-cells,  its  interior  nodes 
are  the  i-cells  themselves,  and  (if  i  >  1)  its  leaves  are  the  (i  -  1) -cells 
together  with  the  input  x. . 

3.3.   Required  Hardware 

It  has  been  mentioned,  of  course,  that  the  cells  of  Method  3  are 
identical  to  those  used  by  Method  1.  We  now  proceed  to  show  that  corres- 
ponding Method  1  and  Method  3  networks  contain  exactly  the  same  number  of 
cells,  as  well. 

Theorem  1 

Each  stage  of  an  (n,k)  Method  3  network  contains  exactly  n  cells. 

Proof 

By  induction.   We  first  show  the  theorem  is  true  for  stage  1.   In 
step  Al  of  the  Method  3  algorithm,  we  construct  a  full  binary  tree 
fanning  in  n  inputs,  where  n  is  a  power  of  2.   This,  of  course,  requires 
n  -  1  cells.   In  Step  A2  we  complete  stage  1  by  installing  one  additional 
cell.   Thus  stage  1  has  exactly  n  cells. 

Now  suppose  the  theorem  is  true  for  stage  i  -  1.  Let  us  consider 
stage  i.   By  Remark  5,  the  i-cells  are  the  interior  nodes  of  a  regular 
binary  tree  whose  leaves  are  the  (i  -  l)-cells  plus  input  x. .   Since  by 
hypothesis  there  are  n  (i  -  l)-cells,  this  tree  has  a  total  of  n  +  1 
leaves.   Thus,  by  a  well-known  result  of  graph  theory,  the  tree  must  have 
n  interior  nodes.   That  is,  there  are  exactly  n  i-cells,  whence 
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the  theorem  is  true  for  stage  i. 

QED 

Since  an  (n,k)  Method  3  network  has  k  stages,  by  Theorem  1  the  total 
number  of  cells  it  contains  is  nk,  precisely  the  same  as  for  Method  1. 
Thus,  in  contrast  to  Method  2,  Method  3  requires  no  more  hardware  than 
the  original  rectangular  array.   Moreover,  it  is  evident  from  the  algorithm 
that  in  a  Method  3  network  each  cell  output  is  connected  to  (at  most)  one 
cell  input.   This  is  also  the  case  in  a  Method  1  array.   Thus  neither 
method  has  a  fan-out  problem,  which  again  is  different  from  Method  2. 

We  therefore  conclude  that  the  Method  3  transformation  of  a  rectangular 
network  amounts  to  nothing  more  than  a  rather  limited  rearrangement  of  the 
interconnections  among  the  nk  cells.   That  this  usually  results  in  a 
significant  increase  in  speed  (see  Chapter  5)  is  indeed  remarkable. 


37 


CHAPTER  4 
METHOD  3:   APPLICATIONS 

Having  just  boasted  the  merits  of  Method  3,  we  must  again  remind  the 
reader  that  unfortunately  it  cannot  always  be  used,  for  often  the 
transformation  is  not  output-preserving.   However,  in  this  chapter  we 
shall  present  several  cases  for  which  Method  3  does  indeed  work.   These 
are  intended  as  illustrative  examples  only;  our  array  methods  are  not 
necessarily  the  most  practical  or  efficient  circuits  available  to  perform 
the  given  computations.   We  shall  also  look  at  some  cases  where  Method  3 
does  not  work  and  then  speculate  as  to  which  properties  of  the  cell 
operations  are  relevant  to  output-preservation. 

4.1.   The  Augmented  Parallel  Counter 

In  this  example  the  interconnections  of  the  Method  1  array  (Figure  1) 
are  single  bit  lines,  and  the  cell  operations  (Figure  2)  are  as  follows: 
#  is  EXCLUSIVE  OR  (XOR)  and  *  is  logical  AND.   Each  cell  is  thus  a  half 
adder  (see  Figure  14). 

To  understand  what  a  rectangular  network  of  half  adders  does,  let  us 
look,  for  instance,  at  Figure  9.   Suppose  first  that  x  =  x„  =  x~  =  0. 
Then  y1   is  just  the  EXCLUSIVE  OR  sum  of  the  inputs   a..,  a« ,  .  .  .,  ag. 
The  carries  from  this  summation  are  then  XOR'ed  together  to  form  y~  ,  and 
finally  y„  is  the  EXCLUSIVE  OR  of  the  carries  from  the  y?   summation. 
Thus  y~y~y,  is  simply  the  ordinary  sum,  in  weighted  binary  form,  of  the 
eight  inputs  a..   In  other  words,  the  three-place  binary  numeral  yoYoYi 
represents  the  total  number  of  a  inputs  which  are  in  the  logic  ONE  state. 
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Hence,  the  network  acts  as  an  eight-input  parallel  counter. 

However,  the  three-bit  output  can  represent  only  8  of  the  9  possible 
a.  sums.   That  is,  y-^y-i  is  actually  the  sum  of  the  a,  's  modulo  8. 
Moreover,  dropping  our  assumption  that  the  x  's  are  zero,  we  see  that 
the  array  output  in  fact  represents  the  modulo  8  sum  of  the  weighted 
binary  number  x-x^x..  and  the  number  of  active  (logic  ONE)  a.  inputs. 

Thus  an  (n,k)  Method  1  network  of  half  adders  is  a  kind  of  generalized 
parallel  counter.   Its  k-bit  output  represents,  in  binary  coded  form  and 
modulo  2  ,  the  total  number  of  the  n  top  inputs  which  are  active,  plus  the 
weighted  binary  number  formed  by  the  k  right-hand  inputs.   We  shall  call 
such  a  circuit  an  "augmented  parallel  counter".   If  k  =  log  n  +  1  and 
the  x.  are  all  zero,  then  the  array  becomes  an  ordinary  n-input  parallel 
counter  —  it  simply  computes  the  number  of  a  's  which  are  ONE. 

Now  to  see  what  happens  when  the  Method  3  transformation  is  applied  to 
an  augmented  parallel  counter,  let  us  return  to  our  previous  example,  the 
(8,3)  network  (Figure  9).   Recalling  that  #  is  XOR,  we  see  from  Figure  10 
that  z- ,  the  first  Method  3  output,  is  the  XOR  of  x..  and  all  the  a.'s. 
That  is,  z.  is  the  least  significant  bit  of  the  sum  of  the  eight  a.'s  and 
the  binary  number  x-x-x. .   Since  *  is  AND,  Figure  11  shows  that  z„  is  the 
XOR  of  x_  and  all  the  carries  from  the  z   summation.   Hence,  z„  is  the 
next-to-least  significant  bit  of  the  sum  of  x_x«x1  and  all  the  a.'s. 
Similarly,  z~  is  the  next  most  significant  bit  of  this  sum  (see  Figure  12). 
Thus  we  see  that  z  z.z  =  y~y9y,  ;   that  is,  the  Method  3  network  performs 
exactly  the  same  computation  as  does  the  original  Method  1  array. 

Clearly,  this  argument  can  be  generalized  to  any  (n,k)  augmented 
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parallel  counter.   The  Method  1  network  and  the  Method  3  network  will 
always  compute  the  same  sum,  column  by  column.   The  two  methods  are 
dissimilar  only  in  that  for  any  one  output  bit  they  produce  the  sum  in  a 
different  order,  the  Method  3  arrangement  being  faster.   We  therefore 
conclude  that  when  #  is  XOR  and  *  is  AND,  the  Method  3  transformation  is 
output-preserving. 

4.2.   The  Partial  Sorter 

Unlike  the  previous  example,  in  which  the  cell  operations  were 
Boolean  functions,  many  of  the  other  cases  we  shall  look  at  feature  cell 
functions  which  operate  on  integer  values.   We  will  then  assume  that  in 
our  arrays  each  interconnecting  line  is  capable  of  carrying  a  representa- 
tion of  any  positive  integer.   Naturally,  there  would  in  practice  be  a 
limit  to  the  size  of  the  integers,  but  this  restriction  has  no  bearing  on 
the  relevance  of  our  examples. 

For  our  next  special  case,  we  take  the  cell  operation  #  to  be  MAX  and 
the  operation  *  to  be  MIN.   Of  course,  MAX  returns  the  larger  of  its  two 
integer  operands,  while  MIN  returns  the  smaller. 

To  see  what  a  rectangular  array  of  such  cells  does,  let  us  look  once 
again  at  Figure  9.   The  network  inputs  are  now  positive  integers.   Clearly, 
y1  will  be  the  largest  of  the  9  numbers  L,  a.,  a.,  .  .  .  ,  aft.   The  other 
8  numbers  are  passed  down  to  the  second  row  of  the  array.   They  are  then 
compared  with  x„ ,  and  the  largest  of  these  9  integers  becomes  y» .   A 
similar  operation  determines  y„.   Hence,  we  see  that  the  network  outputs 
will  simply  be  the  3  largest  numbers  among  the  11  numbers  which  were 
inputs  to  the  array.   It  is  also  easy  to  see  that  if  the  x.'s  were 
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originally  in  order,  with  x-  largest  and  x_  smallest,  then  the  y.'s 


will  likewise  be  in  order. 

Of  course,  similar  reasoning  holds  for  any  (n,k)  Method  1  network 
composed  of  such  cells.   The  output  numbers  are  always  the  k  largest  of 
the  n  +  k  input  numbers.   Moreover,  if  the  right-hand  inputs  are  in 
increasing  order  from  the  bottom  to  the  top  of  the  array,  then  the  same 
will  be  true  of  the  outputs.   Thus  the  network  is  able  to  take  an  ordered 
list  of  length  k  and  an  unordered  list  of  length  n  and  merge  them  into  an 
ordered  list  of  length  k  composed  of  the  largest  elements  in  the  original 
two  lists.   Hence,  it  is  with  some  justification  that  we  call  such  a 
network  a  "partial  sorter". 

It  is  quite  easy  to  convince  oneself  that  in  this  case ,  too ,  Method  3 
is  output-preserving.   Look,  for  example,  at  Figure  12  and  recall  that 
for  each  i  the  //  (i.e.,  MAX)  outputs  of  the  i-cells  are  kept  within  stage 
i  to  eventually  produce  output  z  ,  while  the  *  (i.e.,  MIN)  outputs  are 
passed  on  to  stage  i  +  1.   Hence,  the  Method  3  network  operates  in  exactly 
the  same  way  as  does  the  Method  1  network,  producing  the  largest  k 
numbers  at  its  outputs.   Only  the  order  of  comparison  is  different. 

4.3.   Other  Special  Cases 

Having  presented  two  examples  in  detail,  we  shall  now  simply  give  a 
list  of  a  number  of  other  particular  cases  that  were  investigated.   They 
employ  as  cell  operations  a  wide  variety  of  two-variable  functions,  some 
Boolean  and  some  integral.   Several  of  these  —  XOR,  AND,  MAX,  and  MIN  — 
we  have  used  before;  another  one  of  them  is  just  the  usual  logical  OR. 
The  rest  require  a  brief  explanation. 
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There  are  three  new  Boolean  functions.   COIN,  short  for  "coincidence", 
returns  a  1  if  and  only  if  its  two  arguments  are  equal.   IMP,  which 
stands  for  "implies,"  has  value  1  for  all  cases  except  1  IMP  0.   Finally, 
FIR,  short  for  "first,"  simply  returns  its  first  argument;  i.e., 
x  FIR  y  =  x. 

We  shall  also  use  a  number  of  operations  defined  on  the  set  of 
positive  integers.   The  value  of  GCD  is  the  greatest  common  divisor  of 
its  two  operands.   Similarly,  LCM  is  the  least  common  multiple.   The 
function  ADD  is  just  ordinary  addition,  while  MULT  performs  ordinary  multi- 
plication.  Lastly,  AVG  returns  the  average  of  its  two  arguments;  strictly 
speaking,  this  must  be  defined  on  the  set  of  rationals,  since  the  average 
of  two  integers  is  not  always  an  integer. 

We  combined  these  functions  into  various  #/*  pairs  and  determined  for 
which  of  these  Method  3  is  output-preserving.   The  results  of  this  effort 
are  shown  in  Table  3.   For  the  moment  ignore  the  column  marked  "conjecture 
properties  violated";  this  will  be  explained  in  the  next  section. 

Two  of  the  sixteen  cases  have  already  been  described:  example  no.  1  is 
the  augmented  parallel  counter  and  example  no.  4  is  the  partial  sorter. 
Note  that  we  list  four  other  cases  for  which  Method  3  works.   For  these, 
the  proofs  of  output-preservation  will  be  omitted.   As  for  those  instances 
where  Method  3  fails,  we  merely  found  simple  counterexamples  by  plugging 
in  specific  inputs  and  computing  the  outputs  by  hand.   In  fact,  in  all 
cases  (4,3)  networks  sufficed  to  demonstrate  the  lack  of  output-preservation, 

4.4.   A  Conjecture 

We  say  that  the  cell  operation  *  "distributes  over"  the  cell  operation  # 
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B  =  Boolean 
—  operand  type   I  ■  (positive)  integer 
R  =  (positive)  rational 


01 

iH 

| 

tO 

01 

c 
opei 

// 

ell 
ations 

* 

■  output-preserving? 

conjecture  properties  violated 

1 

XOR 

AND 

B 

yes 

none 

2 

AND 

OR 

B 

yes 

none 

3 

COIN 

OR 

B 

yes 

none 

A 

MAX 

MIN 

I 

yes 

none 

5 

GCD 

LCM 

I 

yes 

none 

6 

GCD 

GCD 

I 

yes 

none 

7 

AND 

XOR 

B 

no 

not  distributive,  XOR  not  idempotent 

8 

COIN 

AND 

B 

no 

not  distributive 

9 

IMP 

OR 

B 

no 

IMP  not  associative,  not  commutative 

10 

OR 

FIR 

B 

no 

FIR  not  commutative 

11 

ADD 

MULT 

I 

no 

MULT  not  idempotent 

12 

ADD 

MIN 

I 

no 

not  distributive 

13 

MAX 

ADD 

I 

no 

ADD  not  idempotent 

1A 

ADD 

ADD 

I 

no 

not  distributive,  ADD  not  idempotent 

15 

MULT 

GCD 

I 

no 

not  distributive 

16 

MAX 

AVG 

R 

no 

AVG  not  associative 

Table  3.   Attempts  to  apply  Method  3. 
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if  a  *  (b  |  c)  -  (a  *  b)  #  (a  *  c)   for  all  a,  b,  c  in  the  set  S  of 
operands.   (If  *  is  not  commutative,  we  also  require  that  (b  //  c)  *  a  = 
(b  *  a)  #(c  *  a).  )   To  make  a  less  familiar  definition,  we  shall  say 
that  the  operation  *  is  "idempotent"  if  a  *  a  =  a  for  all  a  in  S. 

A  large  part  of  our  research  has  been  an  attempt  to  characterize 
those  rectangular  arrays  to  which  Method  3  can  be  successfully  applied. 
That  is,  we  have  tried  to  find  sets  of  conditions  necessary  and/or 
sufficient  for  the  Method  3  transformation  to  be  output-preserving. 
Experience  with  many  examples  as  well  as  consideration  of  several 
theoretical  aspects  have  led  us  to  propose  the  following: 

CONJECTURE 
Method  3  is  output-preserving  if  and  only  if  the  cell 
operations  #  and  *  have  the  following  properties: 

a.  //is  associative  and  commutative 

b.  *  is  associative,  commutative,  and  idempotent 

c.  *  distributes  over  # 

Although  this  statement  has  stubbornly  resisted  all  attempts  to 
prove  it,  no  counterexamples  have  yet  been  found.   The  rightmost 
column  in  Table  3  shows  for  each  case  those  properties  listed  in  the 
conjecture  which  are  not  satisfied  by  the  given  pair  of  cell  operations. 
Any  property  not  mentioned  was  proved  to  hold.   The  examples  marked  with 
"none"  are  those  which  satisfy  all  the  conditions  of  the  conjecture.   Note 
that  these  are  precisely  the  cases  for  which  Method  3  is  output-preserving, 
The  examples  where  Method  3  fails  are  those  which  violate  one  or  more  of 
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the  conjecture  properties.  Hence,  Table  3  is  consistent  with  our  belief 
that  these  properties  are  both  necessary  and  sufficient . 

Since  the  Method  3  algorithm  pays  no  attention  to  the  order  in 
which  inputs  are  combined,  it  is  not  surprising  that  in  order  for  the 
outputs  to  be  the  same  as  in  Method  1,  both  cell  operations  must  be 
associative  and  commutative.   (In  fact,  this  has  been  proved  in  the  case 
of  the  #  operation.)  However,  the  need  for  distributivity  and,  especially, 
for  idempotence  is  much  less  obvious. 

Note  that  Table  3  concerns  arrays  of  arbitrary  dimensions  in  which 
the  cell  operations  are  specified.  While  trying  to  prove  the  necessity 
of  the  conjecture  conditions,  we  investigated  networks  with  arbitrary  cell 
functions  but  where  n  and  k  were  small  enough  to  conveniently  handle. 
In  manual  computations  with  these  networks,  we  needed  to  utilize  all  of 
the  conjecture  properties  in  order  to  get  the  outputs  of  Methods  1  and  3 
to  be  equal.  Nevertheless,  to  actually  prove  necessity  seemed 
impossible.   Perhaps  one  can  somehow  construct  rather  artificial  #  and  * 
operations  which  violate  the  conjecture  conditions  but  for  which  Method  3 
works.   However,  judging  from  our  research  it  appears  doubtful  that  such 
a  case  would  arise  naturally,  that  is  to  say,  when  using  any  standard 
sort  of  functions  for  cell  operations. 

The  situation  is  similar  with  regard  to  sufficiency.  When  working 
with  networks  of  feasibly  small  dimensions,  we  found  the  conjecture 
properties  sufficient  to  obtain  output  equality.   A  general  proof, 
though,  has  remained  elusive.   However,  as  an  indication  that  the  conjecture 
conditions  could  well  be  sufficient,  let  us  point  out  that  they  form 
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a  more  powerful  set  of  postulates  than  may  be  evident  at  first  glance. 
For  example,  using  only  those  properties  we  can  derive  the  following 
formula  (for  all  a,  b  in  S) : 

(a   //   b)    *    (a   //   b)    =    [(a   //  b)    *   a]    //    [(a   #  b)    *   b] 
a  //  b     =    [a  *    (a   //  b)]    //    [b   *    (a  //  b)] 

=  (a  *  a)  #  (a  *  b)  #  (b  *  a)  //  (b  *  b) 
=  a  #  (a  *  b)  #  (a  *  b)  #  b 
a  //  b  =  a  //  b  #  (a  *  b)  #  (a  *  b) 
Moreover,  letting  a  =  b  and  once  again  using  idempotence,  we  obtain: 

a  //  a  =  a  #  a  #  a  #  a  (1) 

Now,  it  may  perhaps  be  possible  to  invent  a  pair  of  functions 
satisfying  the  conjecture  conditions  but  for  which  Method  3  is  not 
output-preserving;  however,  once  again  we  consider  it  very  unlikely  that 
one  would  run  into  such  a  counterexample  by  accident.   For  one  thing,  any 
system  which  has  all  the  conjecture  properties  must  also  satisfy  (1) 
above.   Hence,  quite  probably  it  possesses  as  well  one  of  these  two 
common  properties  (for  all  a  in  S): 

Property  1     a  //  a  =  0,  where  0  is  an  identity  element  for  the 

#  operation 
Property  2     a  #  a  =  a   (i.e.,  #  is  idempotent) 

Indeed,  it  is  easy  to  check  that  one  of  these  holds  for  every  example  in 
Table  3  which  satisfies  all  the  conjecture  conditions.  (Obviously,  both 
properties  cannot  hold  at  once,  unless  0  is  the  only  element  in  S.) 
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Now  the  point  is  that  either  of  these  two  properties ,  when  added  to  the 
conjecture  conditions,  makes  the  two-operation  system  even  more  rich  in 
structure,  allowing  equations  with  #  and  *  to  be  greatly  simplified. 
This  seems  to  enhance  the  likelihood  that  corresponding  Method  1  and 
Method  3  outputs  are  indeed  equal. 

Taking  all  of  this  into  account,  we  come  to  the  following  conclusions. 
In  the  first  place,  our  conjecture  is  very  plausible.  Moreover,  even  if 
it  is  not  literally  true,  the  conjecture  appears  at  any  rate  to  be  a 
highly  reliable  test  for  determining  to  which  arrays  Method  3  can  be 
applied.   Because  the  conjecture  properties  are  easy  to  check,  this  provides 
a  quite  practical  method  for  screening  prospective  function  pairs.   Of 
course,  to  be  completely  safe,  one  must  supply  for  each  individual  case 
an  independent  proof  or  disproof  of  the  output-preservation  property. 
Thus  we  hope  that  future  research  will  lead  to  a  proof  of  the  conjecture, 
a  proof  of  some  modification  of  it,  or  at  least  some  enlightening 
counterexamples . 

4.5.   Additional  Remarks 

Note  that  all  of  the  conjecture  properties  hold  in  any  Boolean 
ring  as  well  as  in  any  Boolean  algebra.   (For  definitions  and  an 

excellent  discussion  of  these  two  types  of  systems,  see  the  book  by 

14 
Burton.   )   Thus  we  would  expect  Method  3  to  be  output-preserving 

for  both  kinds  of  systems.   This  is  illustrated  by  the  first  two  examples 

in  Table  3:   XOR/AND  is  a  Boolean  ring,  while  AND/OR  is  a  Boolean  algebra. 

On  the  other  hand,  MAX/MIN,  for  which  Method  3  also  works,  does  not  belong 

to  either  class  of  systems.  Finally,  it  is  interesting  to  note  that 
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Boolean  rings  satisfy  Property  1,  whereas  Boolean  algebras  satisfy 
Property  2. 

If  the  conjecture  is  true,  then  the  range  of  applications  for  Method  3 
may  be  rather  limited,  for  it  seems  that  the  conjecture  conditions  are 
really  quite  restrictive.   For  example,  among  the  16  two-variable 
switching  functions,  only  AND  and  OR  are  both  commutative  and  idempotent. 
Moreover,  on  the  domain  of  positive  integers,  the  only  operations  we 
could  readily  find  satisfying  all  three  conditions  on  the  *  function  were 
MAX,  MIN,  LCM,  and  GCD.   Note  that  these  four  functions  are  essentially 
the  same!   (For  instance,  to  find  the  least  common  multiple  of  two 
numbers,  we  simply  choose  the  maximum  exponent  for  each  prime  factor; 
GCD  and  MIN  are  likewise  related.) 

Clearly,  idempotence  is  the  most  difficult  condition  to  meet. 
The  other  conjecture  properties  hold  in  any  commutative  ring,  a  very  common 
type  of  system.   Hence,  to  find  applications  for  Method  3  we  might  be 
tempted  to  search  among  ordinary  commutative  rings,  hoping  that  the 
conjecture's  requirement  for  idempotence  is  not  absolute.   Unfortunately, 
the  following  proposition  has  been  proved  (we  omit  the  argument,  which  is 
very  tedious) :   If  the  cell  operations  #  and  *  form  a  commutative  ring 
(with  identity),  where  #  is  addition  and  *  is  multiplication,  and  if  for 
this  system  Method  3  is  output-preserving,  then  the  operation  *  is  idem- 
potent,  that  is,  the  system  is  actually  a  Boolean  ring. 

This  result  is  quite  significant  in  view  of  the  fact  that  there  are 
not,  so  to  speak,  many  Boolean  rings  around.   Indeed,  according  to  the 
famous  Stone  Representation  Theorem,  every  Boolean  ring  is  isomorphic  to 
a  ring  of  sets!   (In  a  ring  of  sets,  addition  is  the  symmetric  difference 
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14 
operation  and  multiplication  is  set  intersection.   See  Burton   for 


details.)  On  the  other  hand,  in  order  for  Method  3  to  work  it  is  not 
necessary  that  the  #/*  system  be  a  ring  at  all;  MAX/MIN  is  a  case  in 
point.   Hence,  the  true  extent  to  which  Method  3  can  be  applied  is  still 
an  open  question. 
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CHAPTER  5 
METHOD  3:   COMPUTATION  TIME 


We  come  now  to  the  crucial  question:   just  how  fast  does  Method  3 
perform?   Let  T„  be  the  propagation  time  in  unit  delays  for  an  (n,k) 
Method  3  network.   Usually  we  shall  consider  n  fixed  and  observe  how  T„ 
varies  as  k  runs  from  1  to  n.   Unfortunately,  an  exact  formula  for  T~  in 
terms  of  n  and  k  remains  to  be  discovered.   However,  a  combination  of 
theoretical  and  empirical  research  has  revealed  a  great  deal  about  the 
behavior  of  the  function  T~,  including  simple  and  accurate  approximations 
valid  at  least  for  the  range  n  <  1024.   Moreover,  we  shall  show  that  T„ 
is  in  general  much  smaller  than  both  T-  and  T~ ,  the  propagation  delays  for 
Methods  1  and  2,  respectively. 

5.1.   Comparison  with  Method  1 

Consider  an  (n,k)  Method  3  network  and  let  h  denote  the  height  of 
stage  i  (as  defined  in  Section  3.1).   Note  that  by  Step  A  of  the  Method 
3  algorithm  (Section  3.2),  h  =  log  n  +  1. 

Now  consider  stage  k,  the  final  stage  in  the  network.  3y  Remark  1 

(Section  3.2),  there  are  k  -  1  levels  above  stage  k.   Hence,  by  Remark  4, 

the  total  number  of  levels  in  the  network  is  k  -  1  +  h.  .   We  thus  have: 

k 

Remark  6      T„  =  h.  +  k  -  1 
3     K 

Hence,  surprisingly,  the  computation  time  of  the  Method  3  array 
depends  solely  on  the  height  of  its  last  stage.   Unfortunately,  it  turns 
out  that  h,  is  in  general  difficult  to  determine.   Much  can  be  gained, 
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however,  by  studying  the  properties  of  the  sequence  h  ,  h„ ,  .  .  . ,  h  . 
In  fact,  we  shall  now  present  a  few  lemmas  on  the  subject,  which  lead 
ultimately  to  a  theorem  comparing  the  speeds  of  Methods  1  and  3. 


Lemma  1 


h±  <  n   (for  all  i,  1  <  i  <  k) 


Proof 
By  Theorem  1  (Section  3.3),  stage  i  contains  n  cells.  Hence, 
noting  Remark  2  (Section  3.2),  it  is  obvious  that  the  height  of  stage  i 
can  be  no  more  than  n. 

QED 


Lemma  2  h.  £  h         (1  <  i  <  k  -  1) 


Proof 
By  Remark  1,  stage  i  +  1  begins  one  level  below  the  top  level  of 
stage  i;   by  Remark  3,  it  ends  at  least  one  level  below  the  bottom  level 
of  stage  i.   Thus  the  height  of  stage  i  +  1  must  be  at  least  as  great  as 
the  height  of  stage  i. 

QED 


Lemma  3  h  =  b.   <=>  h±  =  n     (1  <  i  <  k  -  1) 


Proof 


If  h.  -  n,  then  by  Lemma  2  h.  ..  >   n.   On  the  other  hand,  by  Lemma  1 
h    <  n.   Hence,  h.,-  =  n  =  h . . 
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Suppose  h.  -  h   •   Then  since  (by  Remark  1)  the  top  level  of  stage 
i  +  1  is  one  level  below  the  top  level  of  stage  i,  it  follows  that  the 
bottom  level  of  stage  i  +  1  is  one  level  below  the  bottom  level  of  stage 
i.   Let  L  be  the  bottom  level  of  stage  i  +  1.   By  Remark  3,  level  L 
contains   exactly  one  (i  +  l)-cell;  call  it  C  . 

Ju 

Since  level  L  -  1  is  the  bottom  level  of  stage  i,  it  contains  an 
i-cell,  say  c   1 ;  moreover,  (by  Remark  2)  it  contains  at  least  one 

Li — 1 

(i  +  l)-cell,  say  C    .   Now  according  to  the  Method  3  algorithm,  the  * 

i-i  x 

output  of  c   ,  and  the  #  output  of  C    must  each  be  connected  to  some 
(i  +  l)-cell  below  level  L  -  1.   But  the  only  (i  +  l)-cells  below  level 
L  -  1  is  C    Hence,  one  input  of  C   is  attached  to  c    and  the  other 

ij  •  Li  Li      L 

to  C_  -  .    This  means  that  there  can  be  no  (i  +  l)-cell  besides 

C    on  level  L  -  1,  for  there  are  no  available  (i  +  l)-cells  below  that 

level  to  accept  their  #  outputs. 

Now  consider  level  L  -  2.   It  contains  at  least  one  i-cell  c_  «  and  at 
least  one  (i  +  l)-cell  C  „ .   Since  C  -  is  the  only  available  (i  +  1)- 


cell  below  level  L  -  2,  its  two  inputs  must  be  connected  to  c   _ 


and 


C   _.   Hence,  there  can  be  no  (i  +  l)-cells  besides  C    on  level  L  -  2. 

L—Z  Li—A 

Clearly,  this  argument  can  be  repeated  for  levels  L  -  3,  L  -  4,  etc., 
all  the  way  to  the  top  level  of  stage  i  +  1.   The  result  is  that  every 
level  of  stage  i  +  1  has  only  one  (i  +  l)-cell.   Since  (by  Theorem  1) 
the  stage  has  n  cells,  its  height  must  be  n;  that  is,  h    =  n.   But 
then  since  h.  =  h. . - .  we  get  h.  =  n. 

x     1+1  1 

QED 
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Recalling  that  h-  *  log  n  +  1  and  taking  into  account  Lemmas  1 
through  3,  we  see  that  the  sequence  {h . }  starts  at  log  n  +  1,  strictly 

increases  until  it  reaches  n,  and  remains  at  n  thereafter.   Let  s  be 

'  n 

the  smallest  value  of  i  for  which  h  =  n.   We  can  then  list  the  basic 
properties  of  the  sequence  {h. }  as  follows: 

Property  A  h  -   log  n  +  1 

Property  B  {h . }  strictly  increases  for  1  £  i  <  s 

Property  C  h.  =  n  for  i  £  s 


Note  that  the  height  h.  of  a  stage  is  a  kind  of  inverse  measure  of 
its  parallelism.   Since  (by  Theorem  1)  each  stage  has  exactly  n  cells, 
the  greater  the  height  of  a  stage,  the  less  parallel  is  its  computation. 
Stage  1,  with  height  log  n  +  1,  is  as  parallel  as  it  can  be.   By 
Property  B,  later  stages  gradually  become  less  parallel.   Beyond  some 
point  (s  by  Property  C),  each  stage  has  height  n,  i.e.,  its  computation 
is  completely  serial. 

We  have  now  assembled  enough  information  to  easily  prove  our  theorem. 
Recall  that  for  an  (n,k)  Method  1  network,  the  computation  time  T  = 
n  +  k  -  1.   T_  is  the  computation  time  of  the  corresponding  Method  3 
network. 
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Theorem  2 


a.   for  k  <  s  ,  T_  <  T, 
n   3    1 


b.   for  k  >  8  ,  T  »  T, 
-  n'   3    1 


Proof 
By  Remark  6,  T  -  h  +  k  -  1.   By  Property  C,  for  k  >  s  ,  h,  =  n, 
whence  T~=n+k-l=T,  which  proves  b.   By  Properties  B  and  C, 
for  k  <  s  ,   h,<  n,  whence  T<n  +  k-l  =  T-,  proving  a. 

QED 

Therefore,  in  all  cases  Method  3  is  at  least  as  fast  as  Method  1. 

Moreover,  s   is  a  sort  of  "crossover"  point  between  Methods  1  and  3, 

analogous  to  the  crossover  point  c  between  Methods  1  and  2 .   For  k  <  s  , 

n  n 

Method  3  outperforms  Method  1.   Note  that  since  T  -  T~  =  (n  +  k  -  1)  - 
(h,  +  k  -  1)  =  n  -  h,  ,  it  follows  from  Property  B  that  this  speed 
difference  strictly  decreases  as  k  runs  from  1  to  s  .   Beyond  s  ,  the  two 
methods  have  identical  speed. 

Recalling  how  s  was  defined,  we  can  rephrase  our  result  as  follows: 
If  every  stage  of  the  Method  3  network  has  some  amount  of  parallelism, 
then  the  network  outperforms  the  Method  1  array;  but  if  one  or  more  of 
the  later  stages  is  serial,  then  the  two  methods  have  equal  performance. 

Of  course,  Theorem  2  leaves  us  with  the  question  of  what  the  value 

of  s   is  (for  each  n).   We  would  also  like  to  know  more  about  how  T„  behaves 
n  3 

when  k  <  s  .   These  matters  will  be  addressed  in  the  following  sections, 
n 
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5.2.   Further  Theoretical  Results 

We  shall  shortly  present  theorems  which  provide  more  details  about 
the  speed  of  Method  3.   First,  however,  we  must  prove  two  additional 
lemmas  concerning  the  sequence  {h . }  of  stage  heights. 

Lemma  4 


for  1  <   i  <   s    ,    log  n+i<h.   <n-s     +i 
-     n  —     1  -  n 


Proof 
By  Property  A,  h  =  log  n  +  1.   Thus  by  Property  B  we  must  have 


h2  >  log  n  +  2 


h  >  log  n  +  3 


h   >  log  n  +  s 

s  -   °      n 
n 


This  proves  the  first  inequality. 

By  Property  C,  hg  =  n.   Thus  by  Property  B  we  must  have 
n 

h   .  <  n  -  1 

s  -1  - 
n 


hs  -2  *  n  "  2 
n 


h  =  h   ,   ..  <  n  -  (s  -1) 
1    s  -(s  -1)  -      n 
n   n 


56 


Hence,  hs   j   <  n  -  j  for  0  5  J  5  8  "  1«   To  change  subscripts,  let 
n~J  n 

i  ■  s  -1 .   Then  1  <  i  <  s  ,  and  our  result  becomes  hj  <  n  -  (s   -i)  ■ 
nJ  -   -  n  i-n 

n  -  s  +  i.   This  proves  the  second  inequality. 

QED 


Lemma  5 

Suppose  that  for  some  I<s,hT=»n-s  +  I.  Then  for  all  i  where 

-  n'   I       n 

I<i<s,h.t=n-s  +i. 
-   -  n'   i        n 

Proof 

If  h_  =  n  -  s  +1  for  some  I  <  s  ,  then  by  Property  B  we  must  have: 
I       n  -  n 

h_.,      >     n-s     +1  +  1 

1+1     -  n 

hT , _     >     n-s     +1+2 
1+2     -  n 


h  >     n  -  s     +  s 

s  n         n 

n 


That  is,  for  all  i  where  I<i<s,h.  >n-s  +i.   But  by  Lemma  4  we 
'  -  n*   i  -      n 


also  have  h.  <  n  -  s  +  i. 
i  -      n 


QED 


From  Lemma  4,  we  note  in  particular  that 


log  n+i  <  n-s  +i 

°  -       n 


log  n  <  n-s 
-       n 


s   <  n  -  log  n 
n  - 
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If  s  were  equal  or  very  close  to  this  upper  bound,  then  since  n  -  log  n 
is  in  general  close  to  n,  s  would  also  be  close  to  n.   By  Theorem  2 
(Section  5.1),  this  would  mean  that  Method  3  outperforms  Method  1  for 
most  k  between  1  and  n. 

Thus  in  some  sense  n  -  log  n  is  an  ideal  value  for  s  .   For  that 
reason  we  introduce  the  following  special  notation: 


s   =  n  -  log  n 
n  ° 

d   =  s  -  s 
n      n    n 

Our  hope  is  that  d  is  small  (relative  to  n) . 

In  preparation  for  the  upcoming  theorems,  we  must  make  another 
definition: 

T3  =  log  n  +  2k  -  1 

Moreover,  we  note  the  following  equality: 

Remark  7  T„+d   =n-s+2k-l 
3    n         n 

This  is  a  direct  consequence  of  the  above  definitions  and  will  be  used 
often  in  the  work  that  follows. 


Theorem  3 


for  l<k<s,        T0<T.<T0+d 
-n'  3-3-3         n 


Proof 
From  Lemma  4  we  have: 
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log  n  +  k  <   h,   <  n-s  +k 
k  -       n 


Adding  k  -  1,  we  get: 


logn  +  2k-l  <  h.  +  k  -  1  <  n-s  +  2k  -  1 

K  n 


But  then  using  Remarks  6  and  7,  we  obtain: 


T_  <  T.   <   T0  +  d 
3-3-3    n 


QED 


This  theorem  provides  another  reason  for  hoping  that  d  =  s  -  s 

6       n    n    n 

is  small;  for  if  it  is,  then  the  simple  function  T~  =  log  n  +  2k  -  1  is 

a  good  approximation  to  T„  over  the  range  1  <  k  <  s  .   Of  course,  when 

k  >  s  we  know  by  Theorem  2b  that  T~  =  T,  =  n  +  k  -  1.  Moreover,  d 
n  3    1  n 

small  means  that  s  =  n  -  log  n  is  an  accurate  estimate  for  s  .   Hence, 
n  n 

d  is  a  crucial  parameter.   If  it  is  small  compared  to  n,  then  to  a  good 

approximation  we  know  the  values  of  T_  for  all  k,  plus  we  know  (as 

mentioned  earlier)  that  T„  <  T..  for  all  k  except  those  very  close  to  n. 

We  postpone  to  the  next  section  our  evidence  that  d  is  indeed  quite  close 

to  zero. 

We  can  improve  our  results  even  more  by  introducing  another  parameter, 

Note  that  when  k  =  s  ,  we  have: 

n 

T„  =  T  (by  Theorem  2b) 

=  n  +  k  -  1 

=  n-k+2k-l 

=  n-s  +2k-l 
n 

=  T0  +  d         (by  Remark  7) 
J    n 
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Thus  we  can  define  e  to  be  the  smallest  value  of  k  for  which  T.  = 

n  3 

T0  +  d  .   Obviously,  e  <  s  . 
3    n  J '  n  -  n 


Theorem  4 


For  e  <  k  <  s  ,   T0=T0+d 
n  -   -  n     3      3    n 


Proof 
Using  Remarks  6  and  7,  we  find  that: 


T0  =  T„  +  d   <===>  h.  +  k  -  1  -  n-s  +  2k  -  1 
3      3    n         k  n 

<==>  h,   =  n  -  s  +  k 
k         n 

Hence,  by  the  definition  ofe,  he  =n-s  +e.   But  then  according 
*  '  n'    n       n    n 

to  Lemma  5,  when  e  <k<s,  h.  =  n  -  s  +k,  whence  (from  the  above) 
'       n  -   -  n    k       n 


TQ  =  T»  +  d  . 
3    3    n 


QED 


Thus  for  values  of  k  between  e  and  s  ,  T_  is  given  exactly  by  the 

n      n  j 

simple  expression  log  n  +  2k  -  1  +  d  .   Since  for  k>s  ,T„=n+k-l, 

only  for  k  <  e  is  a  formula  for  T0  not  known  (and  even  in  that  case  the 
n  3 

bounds  of  Theorem  3  apply).   Hence,  it  would  be  nice  if  e  were 
relatively  small.   In  the  next  section  we  shall  find  that  this  is 
apparently  the  case. 

Table  4  summarizes  our  results  concerning  T„.  The  same  information 
is  shown  pictorially  in  Figure  15.  This  graph  is  intended  to  illustrate 
the  general  theory  and  has  been  distorted  for  the  sake  of  clarity.  (That  is, 
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Figure  15.   Speed  comparison  between  Method  3  and  Method  1. 


62 

the  relative  magnitudes  of  the  various  parameters  are  not  as  shown  here.) 

We  consider  n  fixed  and  have  sketched  T..  and  T„  as  functions  of  k,  with 

k  running  from  1  to  n.   Note  that  T  ■  n  +  k  -  1  is  a  straight  line  with 

slope  1.   Since  T~  =  log  n  +  2k  -  1,  both  T_  and  T-  +  d   are  lines  of 
3  3      3    n 

slope  2.   (Of  course,  since  k  is  an  integer,  none  of  these  are  in  reality 
continuous  lines.   Moreover,  we  have  drawn  T»  as  a  smooth  curve  between 
1  and  e  only  for  the  sake  of  the  picture;  actually,  its  exact  behavior 
in  this  region  is  a  matter  of  uncertainty.) 

Taking  T„  +  d  as  an  approximation  for  T„,  we  can  compute  the  speed 
difference  between  Methods  1  and  3  as  follows: 

Tl  "  T3  *  Tl  "  (*3  +  dn  >  " 

(n  +  k  -  1)  -  (n  -  s  +  2k  -  1)  =  s  -  k 

n  n 

Assuming  s  is  close  to  n  and  that  k  is  not  too  small,  we  can  also 
n 

estimate  the  speedup: 

T        T 

_i  *     1  n  +  k  -  1       ~ 

T0     ,.       "   n  -  s  +  2k  -  1 

3     T-  +  d 
3    n 

2k      2U   k; 
As  a  final  note  on  Figure  15,  we  point  out  that  T„  crosses  the  line 

As 

T,  at  s  .   This  is  because  when  k  =  s  ,  T0  =  T-  (since  s  <  s  ) ,  and 
3n  n*   3    1        n-n' 

thus  we  get: 

n  -  log  n  =  k 

n  =  log  n  +  k 

n+k-1  =  logn+2k-l 
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T   =  T 

T   =  T 
l3  3 

From  this  it  follows  that  for  all  k  <  s  ,  T0  <  T«  <  T0  +  d  .   This  is  a 

-  n*   3-3-3    n 

slight  extension  of  Theorem  3. 

5.3.   Some  Empirical  Results 

The  results  of  the  last  section  are  useful  only  if  d  and  e  are 

J  n      n 

small  compared  to  n.   Unfortunately,  general  formulas  for  these  quantities 
are  not  known.   For  that  reason  we  wrote  a  computer  program  of  the  Method 
3  algorithm  which,  for  a  given  n,  calculates  T„  for  all  values  of  k 

between  1  and  that  n.   From  this  information,  d  and  e  are  easily 

'  n      n 

determined.   The  program  was  run  for  values  of  n  up  through  1024,  in  order 
to  get  an  idea  of  the  relative  magnitudes  of  the  various  Method  3  parameters 
and  to  see  how  they  vary  with  n. 

Our  results  are  presented  in  Table  5.    Note  that,  as  we  had  hoped, 
the  crucial  quantity  d  is  indeed  very  small  compared  to  n.   In  fact,  as  a 
fraction  of  n,  d  actually  decreases  as  n  approaches  1024.   Although  it  is 
dangerous  to  extrapolate  from  this  kind  of  data,  the  results  are  certainly 
encouraging. 

Similarly,  over  the  range  investigated  the  parameter  e  stays  fairly 
small;   as  a  fraction  of  n  it  fluctuates  within  an  interval  bounded  by  7 
and  11  percent.   This  behavior  is  somewhat  less  promising  than  that  of 
d  ,  but  it  is  good  enough  to  indicate  the  usefulness  of  Theorem  4, 
especially  since  s  seems  to  stay  so  close  to  n. 

This  brings  us  to  the  most  impressive  aspect  of  our  table.   Note  that 

g 

n  rises  steadily  as  n  increases.   Beyond  n  =  64,   this  ratio  is  greater 
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s 
than  90%.   This  is  significant  because  —  represents  that  fraction  of 

k's  range  over  which  Method  3  outperforms  Method  1.  If  our  data  is  at  all 
representative,  then  indeed  Method  3  is  a  significant  improvement  over  the 
use  of  a  rectangular  array. 

This  conclusion  is  rendered  even  more  compelling  by  Figures  16 
through  19.   These  graphs  show  the  complete  results  of  our  computer  simu- 
lations for  several  values  of  n;  T  >  T_ ,  T„,,  and  T_  have  been  plotted 
together  for  all  values  of  k  between  1  and  n.   (We  have  connected  the 
plotted  points  so  as  to  present  continuous  curves.)  Thus  from  actual 
data  we  can  directly  compare  the  speeds  of  all  the  methods  which  we  have 
discussed.   In  particular,  we  see  that  Method  3  is  superior  to  Method  1 
over  nearly  the  entire  range  of  k,  the  speed  difference  being  especially 

drastic  when  k  «  n. 

/\      ^ 
In  these  graphs  the  Method  3  bounds  T~  and  T„  +  d  are  shown  as 

3      3    n 

dashed  lines.   Note  that  T»  =  T„  +  d  everywhere  except  in  the  extremes  of 

3    3    n 

k's  range.   Of  course,  for  k  close  to  n,  T.  =  T. .   For  k  close  to  0,  the 
behavior  of  T„  is  somewhat  erratic  (though  of  course  within  the  given 
bounds).   We  observe  also  that  for  all  k  <  §  =  n  -  log  n,  T»  is  only 
slightly  greater  than  T„  =  log  n  +  2k  -  1. 

5.4.   Comparison  with  Method  2 

In  Figures  16  through  19  we  also  see  that  Method  3  is  much  faster 
than  Method  2.   In  fact,  this  difference  in  performance  becomes  more  marked 
for  the  larger  values  of  n.   For  any  given  n,  the  speed  difference  increases 
with  k.   But  note  that  even  when  k  is  so  small  that  T„  <  T. ,  we  still  have 
that  T3  <  T2. 
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Figure  16.   Speed  comparison  for  n  =  16. 
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Figure  17.   Speed  comparison  for  n  =  64. 
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Figure  18.   Speed  comparison  for  n  *  256. 
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2000- 


Figure  19.   Speed  comparison  for  n  =  1024. 
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These  results  are  not  surprising  in  view  of  our  approximations  for 
the  speeds  of  the  two  methods: 

T2  s  T2  «  k(log  n  +  1) 
T3  z     T3  -   log  n  +  2k  -  1 

With  n  fixed,  T_  and  T„  are  straight  lines  which  intersect  at  k  ■  1;  but 

/\  * 

whereas  T„  has  a  slope  of  log  n  +  1,  T_  has  a  slope  of  only  2.   In  reality, 

the  speed  difference  between  Methods  2  and  3  is  not  quite  as  great  as 

this  would  indicate,  since  T«  is  slightly  smaller  than  T?  and  T-  is  slightly 

larger  than  T_.   Nevertheless,  the  speedup  is  remarkably  great,  at  least 

over  the  range  of  n  for  which  we  have  data. 

Of  course,  the  discussion  above  applies  equally  well  to  a  comparison 

between  Method  2'  and  Method  3.   In  fact,  since  Method  2'  is  slower  than 

Method  2,  the  speedup  provided  by  Method  3  is  in  this  case  even  larger. 
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CHAPTER  6 
CONCLUSIONS 

We  present  here  a  brief  review  of  the  results  of  this  paper, 
comparing  the  various  techniques  we  have  introduced  for  transforming 
iterative  arrays.   The  numbers  n  and  k  are  the  dimensions  of  the 
original  rectangular  (Method  1)  network.   T.  refers  to  the  propagation 
time  in  unit  delays  of  Method  i.   Recall  that  T  =  n  +  k  -  1. 

6.1.  Methods  2  and  2f 


Method  2  has  a  delay  time  of  order  k  log  n.   It  is  faster  than 

Method  1  only  when  k  is  smaller  than  about  -z .   In  that  case,  the 

log  n 

speedup  of  Method  2  over  Method  1  is  given  by: 


1  z    i(-i_\ 

2  k  V  log  nj 


2 

(These  approximations  are  valid  for  n  at  least  as  large  as  1024  and 
probably  much  larger.)  Method  2  uses  more  hardware  than  Method  1  and 
requires  very  high  fan-out  ratings. 

Method  2'  is  a  variation  of  Method  2  which  necessitates  much  lower 
fan-out  values.   However,  it  uses  even  more  hardware  than  Method  2. 
Method  2f  is  also  slightly  slower  than  Method  2.   Both  of  these  methods 
can  be  used  to  transform  any  Method  1  array,  regardless  of  the  cell 
operations  #  and  *  (see  Figure  2),  except  that  #  must  be  associative. 

6.2.   Method  3 

It  appears  that  Method  3  can  be  applied  only  to  arrays  which  satisfy 
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fairly  restrictive  requirements.   We  conjecture  necessary  and  sufficient 
conditions  to  be  that  *  is  associative,  commutative,  idempotent,  and 
distributive  over  // ,  where  //  is  associative  and  commutative.   When 
Method  3  can  be  used,  it  is  at  least  as  fast  as  Method  1,  regardless 
of  the  values  of  n  and  k.   Although  an  exact  formula  for  T„  is  unknown, 
much  has  been  learned  about  the  behavior  of  this  function  —  see  Table  4 
and  Figure  15. 

In  practice,  Method  3  is  usually  considerably  faster  than  all  the 
other  methods  discussed.   Under  certain  assumptions,  we  find  that  Method 
3  has  a  delay  time  of  order  log  n  +  k,  as  well  as  speedups  given  by: 


T3     2  U  +  k; 


T      T 

2  -   2f  ~  1 

TfT"  ~       7f-  "      7T  log  n 

13        13         I 


These  results  are  borne  out  by  empirical  data  gathered  for  values  of  n  up 
to  1024.  For  larger  n,  we  cannot  be  certain  that  Method  3  performs  as  well, 
although  indications  are  promising. 

The  speed  advantage  of  Method  3  over  Method  1  is  obtained  at  no 
hardware  expense  whatsoever.   The  Method  3  transformation  is  merely  a 
rearrangement  of  the  lines  which  interconnect  the  cells  of  the  Method  1 
array.   Moreover,  Method  3  entails  no  extra  fan-out.  The  one  disadvantage 
is  that  Method  3  produces  an  interconnection  pattern  which  is  very 
irregular. 
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6.3.   Summary 

Our  results  are  summarized  more  concisely  in  Table  6.   For  Methods 
2  and  2',  "hardware  required"  means  the  number  of  //-modules  plus  the  number 
of  *-modules.   (Otherwise,  it  means  the  number  of  #/*-modules. )   For 
Methods  2  and  3,  the  speed  approximations  given  in  the  table  are  valid  for 
n  up  to  at  least  1024.  However,  the  best  way  to  appreciate  the  speed 
differences  among  all  of  these  methods  is  to  look  at  the  graphs  of 
actual  data  in  Figures  16  through  19. 

It  is  clear  that  where  Method  3  can  be  applied  it  is  vastly 
superior  to  the  other  methods  we  have  considered.   If  Method  3  cannot  be 
used,  then  Method  2  or  Method  21  may  be  the  best  choice,  but  only  if  k 
is  small  compared  to  n.   Otherwise,  we  shall  have  to  be  satisfied  with  the 
original  Method  1  array.   Future  research  should  be  directed  toward 
developing  transformation  techniques  with  benefits  at  least  as  great  as 
those  of  Methods  2,  2',  and  3  but  with  less  severe  limitations. 
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